XML example and exercise

study examples of accessing nodes in XML tree structure
work on exercise to be completed and submitted

reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
data source: http://www.dbis.informatik.uni-goettingen.de/Mondial



In [1]:

    
from xml.etree import ElementTree as ET

XML example

for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html



In [2]:

    
document_tree = ET.parse( './data/mondial_database_less.xml' )



In [3]:

    
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text









    



Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra



In [4]:

    
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]









    



* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella

XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

10 countries with the lowest infant mortality rates
10 cities with the largest population
10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
name and country of a) longest river, b) largest lake and c) airport at highest elevation



In [5]:

    
import pandas as pd
import numpy as np

document = ET.parse( './data/mondial_database.xml' )
root = document.getroot()

Question 1



In [6]:

    
country,mortality = [],[]
for c in root.findall('country'):
    country.append(c.find('name').text)
    try:
        mortality.append(float(c.find('infant_mortality').text))
    except:
        mortality.append(np.nan)

mortalities = pd.concat({'country':pd.Series(country),
                      'mortality':pd.Series(mortality)},axis=1)
mortalities.sort('mortality',ascending=True).head(10)









    Out[6]:






  
    
      
      country
      mortality
    
  
  
    
      38
      Monaco
      1.81
    
    
      98
      Japan
      2.13
    
    
      117
      Bermuda
      2.48
    
    
      36
      Norway
      2.48
    
    
      106
      Singapore
      2.53
    
    
      37
      Sweden
      2.60
    
    
      10
      Czech Republic
      2.63
    
    
      78
      Hong Kong
      2.73
    
    
      79
      Macao
      3.13
    
    
      44
      Iceland
      3.15

Question 2



In [7]:

    
country, city, population = [], [], []
for c in root.findall('./country/city'):
    country.append(c.attrib['country'])
    city.append(c.find('name').text)
    try:
        population.append(int(c.find('population').text))
    except:
        population.append(np.nan)

cityPops = pd.concat({'country':pd.Series(country),
                      'city':pd.Series(city),
                      'population':pd.Series(population)},axis=1)
cityPops.sort('population',ascending=False).head(10)









    Out[7]:






  
    
      
      city
      country
      population
    
  
  
    
      176
      Seoul
      ROK
      10229262
    
    
      128
      Hong Kong
      HONX
      7055071
    
    
      164
      Al Qahirah
      ET
      6053000
    
    
      80
      Bangkok
      THA
      5876000
    
    
      92
      Ho Chi Minh
      VN
      3924435
    
    
      177
      Busan
      ROK
      3813814
    
    
      216
      New Taipei
      RC
      3722082
    
    
      89
      Hanoi
      VN
      3056146
    
    
      163
      Al Iskandariyah
      ET
      2917000
    
    
      215
      Taipei
      RC
      2626138

Question 3

Processing



In [8]:

    
population, country, year = [], [], []
for c in root:
    current = c.find('name').text
    for pop in c.findall("population"):
        country.append(current)
        year.append(int(pop.attrib['year']))
        try:
            population.append(int(pop.text))
        except:
            population.append(np.nan)
pops = pd.concat({'country':pd.Series(country),
                      'population':pd.Series(population),
                      'year':pd.Series(year)},axis=1)
pops.head(10)



In [9]:

    
#ref: http://stackoverflow.com/questions/27488080/
pops = pops.groupby('country').apply(lambda g: g[g.year==g.year.max()])
pops = pops.reset_index(drop=True)
pops.head()









    Out[9]:






  
    
      
      country
      population
      year
    
  
  
    
      0
      Afghanistan
      26023100
      2013
    
    
      1
      Albania
      2800138
      2011
    
    
      2
      Algeria
      37062820
      2010
    
    
      3
      American Samoa
      55519
      2010
    
    
      4
      Andorra
      78115
      2011



In [10]:

    
country,group,percentage = [], [], []

for c in root:
    current = c.find('name').text
    for pop in c.findall("ethnicgroup"):
        country.append(current)
        group.append(pop.text)
        percentage.append(float(pop.attrib['percentage']))

groups = pd.concat({'country':pd.Series(country),
                      'name':pd.Series(group),
                      'percentage':pd.Series(percentage)},axis=1)

groups.head()









    Out[10]:






  
    
      
      country
      name
      percentage
    
  
  
    
      0
      Albania
      Albanian
      95.0
    
    
      1
      Albania
      Greek
      3.0
    
    
      2
      Greece
      Greek
      93.0
    
    
      3
      Macedonia
      Macedonian
      64.2
    
    
      4
      Macedonia
      Albanian
      25.2



In [11]:

    
combined = pd.merge(groups,pops,on="country")
combined['subpop'] = (combined.percentage*combined.population)//100
combined.head()









    Out[11]:






  
    
      
      country
      name
      percentage
      population
      year
      subpop
    
  
  
    
      0
      Albania
      Albanian
      95.0
      2800138
      2011
      2660131
    
    
      1
      Albania
      Greek
      3.0
      2800138
      2011
      84004
    
    
      2
      Greece
      Greek
      93.0
      10816286
      2011
      10059145
    
    
      3
      Macedonia
      Macedonian
      64.2
      2059794
      2011
      1322387
    
    
      4
      Macedonia
      Albanian
      25.2
      2059794
      2011
      519068

Answer:



In [12]:

    
combined.groupby('name').sum().sort('subpop',ascending=False).subpop.head(10)









    Out[12]:





name
Han Chinese    1245058800
Indo-Aryan      871815583
European        494872201
African         318325104
Dravidian       302713744
Mestizo         157734349
Bengali         146776916
Russian         131856989
Japanese        126534212
Malay           121993548
Name: subpop, dtype: float64

Question 4



In [13]:

    
# a)
country, name, length = [], [], []
for r in root.findall('.//river'):
    country.append(r.attrib['country'])
    name.append(r.find('name').text)
    try:
        length.append(int(r.find('length').text))
    except:
        length.append(np.nan)

rivers = pd.concat({'country':pd.Series(country),
                      'name':pd.Series(name),
                      'length':pd.Series(length)},axis=1)
rivers.sort('length',ascending=False).head(1)









    Out[13]:






  
    
      
      country
      length
      name
    
  
  
    
      174
      CO BR PE
      6448
      Amazonas



In [14]:

    
# b)
country, name, area = [], [], []
for l in root.findall('.//lake'):
    country.append(l.attrib['country'])
    name.append(l.find('name').text)
    try:
        area.append(int(l.find('area').text))
    except:
        area.append(np.nan)

lakes = pd.concat({'country':pd.Series(country),
                      'name':pd.Series(name),
                      'area':pd.Series(area)},axis=1)
lakes.sort('area',ascending=False).head(1)









    Out[14]:






  
    
      
      area
      country
      name
    
  
  
    
      54
      386400
      R AZ KAZ IR TM
      Caspian Sea



In [15]:

    
# c) (Answer here appears to be a bit out of date)
country, name, elevation = [], [], []
for a in root.findall('.//airport'):
    country.append(a.attrib['country'])
    name.append(a.find('name').text)
    try:
        elevation.append(int(a.find('elevation').text))
    except:
        elevation.append(np.nan)

airports = pd.concat({'country':pd.Series(country),
                      'name':pd.Series(name),
                      'elevation':pd.Series(elevation)},axis=1)
airports.sort('elevation',ascending=False).head(1)









    Out[15]:






  
    
      
      country
      elevation
      name
    
  
  
    
      80
      BOL
      4063
      El Alto Intl

	country	population	year
0	Albania	1214489	1950
1	Albania	1618829	1960
2	Albania	2138966	1970
3	Albania	2734776	1980
4	Albania	3446882	1990
5	Albania	3249136	1997
6	Albania	3304948	2000
7	Albania	3069275	2001
8	Albania	2800138	2011
9	Greece	1096810	1861

	country	mortality
38	Monaco	1.81
98	Japan	2.13
117	Bermuda	2.48
36	Norway	2.48
106	Singapore	2.53
37	Sweden	2.60
10	Czech Republic	2.63
78	Hong Kong	2.73
79	Macao	3.13
44	Iceland	3.15

	city	country	population
176	Seoul	ROK	10229262
128	Hong Kong	HONX	7055071
164	Al Qahirah	ET	6053000
80	Bangkok	THA	5876000
92	Ho Chi Minh	VN	3924435
177	Busan	ROK	3813814
216	New Taipei	RC	3722082
89	Hanoi	VN	3056146
163	Al Iskandariyah	ET	2917000
215	Taipei	RC	2626138

	country	population	year
0	Afghanistan	26023100	2013
1	Albania	2800138	2011
2	Algeria	37062820	2010
3	American Samoa	55519	2010
4	Andorra	78115	2011

	country	name	percentage
0	Albania	Albanian	95.0
1	Albania	Greek	3.0
2	Greece	Greek	93.0
3	Macedonia	Macedonian	64.2
4	Macedonia	Albanian	25.2